Constructing Resiliant Communication Infrastructure for Runtime Environments

نویسندگان

  • George Bosilca
  • Camille Coti
  • Thomas Hérault
  • Pierre Lemarinier
  • Jack J. Dongarra
چکیده

Next generation HPC platforms are expected to feature millions of cores distributed over hundreds of thousands of nodes, leading to scalability and fault-tolerance issues for both applications and runtime environments dedicated to run on such machines. Most parallel applications are developed using a communication API such as MPI, implemented in a library that runs on top of a dedicated runtime environment. Strong efforts have been made in the past decades to improve the performance, scalability and fault-tolerance at the library level. The most recent techniques propose to deal with failures locally, to avoid stopping and restarting the whole system. As a consequence, fault-tolerance becomes a critical property of the runtime environment. A runtime environment is a service of a parallel system to start and monitor applications. It is deployed on the parallel system by a launching service, usually following a spanning tree to improve scalability of the deployment. The first task of the runtime environment is then to build its own communication infrastructure to synchronize the tasks of the parallel application. A fault-tolerant runtime environment must detect failures, and coordinate with the application to recover from them. Communication infrastructures that are used today (e.g. trees and rings) are usually built in a centralized way and fail at providing support for fault-tolerance because a few failures lead with a high probability to disconnected components. Previous works [2] have demonstrated that the Binomial Graph topology (BMG) is a good candidate as a communication infrastructure for supporting both scalability and fault-tolerance for runtime environments. Roughly speaking, in a BMG, each process is the root of a binomial tree gathering all the processes. In this paper, we present and analyze a self-stabilizing algorithm1 to transform the underlying communication infrastructure provided by the launching service into a BMG, and maintain it in spite of failures. We demonstrate that this algorithm is scalable, tolerate transient failures, and adapt itself to topology changes.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The application of MDA in distributed services of Run-time Infrastructure

This paper proposes the application of Model Driven Architecture (MDA) for distributed services Run-time Infrastructure to achieve reusing simulation services and communicating in heterogeneous network environments. Platform-Independent Model (PIM) is built based on simulation services definition in service-oriented distributed Runtime Infrastructure by interface definition language Slice to de...

متن کامل

Optimizing Byzantine Consensus for Fault-Tolerant Embedded Systems with Ad-Hoc and Infrastructure Networks

Consensus algorithms are an important building block for fault-tolerant distributed systems. This paper investigates approaches to optimize solutions of distributed consensus to the properties of embedded systems. We discuss alternatives that allow constructing better practical solutions in realistic environments. For example, many networked embedded systems are equipped with both ad-hoc commun...

متن کامل

Secure User Authentication Mechanism in Digital Home Network Environments

The home network is a new IT technology environment for making an offer of convenient, safe, pleasant, and blessed lives to people, making it possible to be provided with various home network services by constructing home network infrastructure regardless of devices, time, and places. This can be done by connecting home devices based on wire and wireless communication networks, such as mobile c...

متن کامل

Reconfigurable Scientific Applications on GRID Services

This paper proposes a runtime environment for dynamically changing, parallel scientific applications. This kind of applications is motivated by the LOFAR/LOIS project aiming at a multidisciplinary research platform for natural scientists and engineers. The dynamic infrastructure in turn is than mapped to Grid Services environments.

متن کامل

JECho: Supporting Distributed High Performance Applications with Java Event Channels

This paper presents JECho, a Java-based communication infrastructure for collaborative high performance applications. JECho implements a publish/subscribe communication paradigm, permitting distributed, concurrently executing sets of components to provide interactive service to collaborating end users via event channels. JECho’s efficient implementation enables it to move events at rates higher...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009